Setup Completed!
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Mw Magnitude | Ms Magnitude | Mb Magnitude | Ml Magnitude | ... | Total Effects : Missing Description | Total Effects : Injuries | Total Effects : Injuries Description | Total Effects : Damages in million Dollars | Total Effects : Damage Description | Total Effects : Houses Destroyed | Total Effects : Houses Destroyed Description | Total Effects : Houses Damaged | Total Effects : Houses Damaged Description | Coordinates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||||||
| 78 | NaN | 334 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 41.2, 19.3 |
| 84 | Tsunami | 344 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | SEVERE (~>$5 to $24 million) | NaN | NaN | NaN | NaN | 40.3, 26.5 |
| 9989 | Tsunami | 346 | NaN | NaN | NaN | 6.8 | NaN | 6.8 | NaN | NaN | ... | NaN | NaN | NaN | NaN | MODERATE (~$1 to $5 million) | NaN | Many (~101 to 1000 houses) | NaN | NaN | 41.4, 19.4 |
| 110 | NaN | 438 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 35.5, 25.5 |
| 9971 | Tsunami | 557 | NaN | NaN | NaN | 7.0 | NaN | 7.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.9, 27.6 |
5 rows × 42 columns
Number of rows: 6208 Number of columns: 42 List of all columns: ['Flag Tsunami', 'Year', 'Month', 'Day', 'Focal Depth', 'EQ Primary', 'Mw Magnitude', 'Ms Magnitude', 'Mb Magnitude', 'Ml Magnitude', 'MFA Magnitude', 'Unknown Magnitude', 'Intensity', 'Country', 'State', 'Location name', 'Region code', 'Earthquake : Deaths', 'Earthquake : Deaths Description', 'Earthquake : Missing', 'Earthquake : Missing Description', 'Earthquake : Injuries', 'Earthquake : Injuries Description', 'Earthquake : Damage (in M$)', 'Earthquake : Damage Description', 'Earthquakes : Houses destroyed', 'Earthquakes : Houses destroyed Description', 'Earthquakes : Houses damaged', 'Earthquakes : Houses damaged Description', 'Total Effects : Deaths', 'Total Effects : Deaths Description', 'Total Effects : Missing', 'Total Effects : Missing Description', 'Total Effects : Injuries', 'Total Effects : Injuries Description', 'Total Effects : Damages in million Dollars', 'Total Effects : Damage Description', 'Total Effects : Houses Destroyed', 'Total Effects : Houses Destroyed Description', 'Total Effects : Houses Damaged', 'Total Effects : Houses Damaged Description', 'Coordinates']
The Significant Earthquake Database is a global listing of over 6,200 earthquakes from 2150 BC to the present. Datset can be found on OpenDatasoft website.
A significant earthquake is classified as one that meets at least one of the following criteria: caused deaths, caused moderate damage (approximately 1 million dollars or more), magnitude 7.5 or greater, Modified Mercalli Intensity (MMI) X or greater, or the earthquake generated a tsunami. The database provides information on the date and time of occurrence, latitude and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollage damage estimates. References, political geography, and additional comments are also provided for each earthquake. If the earthquake was associated with a tsunami or volcanic eruption, it is flagged and linked to the related tsunami event or significant volcanic eruption.
The magnitude is a measure of seismic energy. The magnitude scale is logarithmic. An increase of one in magnitude represents a tenfold increase in the recorded wave amplitude. However, the energy release associated with an increase of one in magnitude is not tenfold, but about thirtyfold. For example, approximately 900 times more energy is released in an earthquake of magnitude 7 than in an earthquake of magnitude 5. Each increase in magnitude of one unit is equivalent to an increase of seismic energy of about 1.6 x 10,000,000,000,000 ergs. All magnitudes have valid values between 0 and 10.
Feature Engineering will be performed for each of these column groups separately. First we will decide what columns to keeps, measure number of null values and decide what to do with them . Also we will consider does some features need to be combined in some way for better further analysis.
| Focal Depth | EQ Primary | Intensity | Flag Tsunami | |
|---|---|---|---|---|
| ID Earthquake | ||||
| 10245 | 26.0 | 6.9 | 9.0 | Tsunami |
| 10267 | 39.0 | 7.1 | 9.0 | NaN |
| 10367 | 10.0 | 5.3 | NaN | NaN |
| 10430 | 10.0 | 3.8 | NaN | NaN |
| 10515 | 10.0 | 6.6 | 7.0 | NaN |
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Focal Depth 3243 non-null float64 1 EQ Primary 4416 non-null float64 2 Intensity 2826 non-null float64 3 Flag Tsunami 1838 non-null object dtypes: float64(3), object(1) memory usage: 242.5+ KB
| Focal Depth | EQ Primary | Intensity | |
|---|---|---|---|
| count | 3243.000000 | 4416.000000 | 2826.000000 |
| mean | 41.064755 | 6.458084 | 8.283439 |
| std | 70.317966 | 1.045100 | 1.825092 |
| min | 0.000000 | 1.600000 | 2.000000 |
| 25% | 10.000000 | 5.700000 | 7.000000 |
| 50% | 25.000000 | 6.500000 | 8.000000 |
| 75% | 40.000000 | 7.300000 | 10.000000 |
| max | 675.000000 | 9.500000 | 12.000000 |
Text(0, 0.5, 'Frequency')
From the distribution plot we can see that earthquake magnitude looks like normal distribution with mean somewhere around 6.5.
Text(0, 0.5, 'Frequency')
Here we most of earthquakes have focal depth less than 100 km. Also we can see that there are some outliers with focal depth more than 200 km.
Tsunami 1838 Name: Flag Tsunami, dtype: int64
We can see that if the tsunami occured we have flag set to Tsunami and if not the flag is left to be null. We will convert this column to boolean type with values True or False.
False 4370 True 1838 Name: Flag Tsunami, dtype: int64
| Coordinates | Country | State | Location name | Region code | |
|---|---|---|---|---|---|
| ID Earthquake | |||||
| 10245 | 5.504, 125.066 | PHILIPPINES | NaN | PHILIPPINES: SARANGANI | 170.0 |
| 10267 | 18.339, -98.68 | MEXICO | NaN | MEXICO: MEXICO CITY, MORELOS, PUEBLA | 150.0 |
| 10367 | 26.374, 90.165 | INDIA | NaN | INDIA: WEST BENGAL | 60.0 |
| 10430 | 20.0, 72.9 | INDIA | NaN | INDIA: MAHARASHTRA: PALGHAR | 60.0 |
| 10515 | 12.021, 124.123 | PHILIPPINES | NaN | PHILIPPINES: MASBATE | 170.0 |
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Coordinates 6153 non-null object 1 Country 6208 non-null object 2 State 323 non-null object 3 Location name 6207 non-null object 4 Region code 6207 non-null float64 dtypes: float64(1), object(4) memory usage: 291.0+ KB
We can see that for columns in this group we have small amount of null values, except for state. Let's see what are values of that row and how usefull they will be for further analysis.
State value counts: CA 103 AK 82 HI 15 PR 14 GU 14 NV 9 NY 8 TAS 8 UT 8 VI 7 BC 6 OK 6 WA 6 MT 5 MP 4 MO 3 PA 3 WY 2 AR 2 KY 2 ID 2 MA 2 CO 2 OR 2 TX 1 CT 1 NC 1 AL 1 VA 1 NH 1 IL 1 SC 1 Name: State, dtype: int64
This seems to show state where the earthquake happended in United States. To prove the theory we will check what values do we have in column country, when state field is not null.
USA 267 USA TERRITORY 39 AUSTRALIA 8 CANADA 6 MEXICO 1 BERING SEA 1 GHANA 1 Name: Country, dtype: int64
We can see that previous hypothesis is mostly true. Because of that this column will be only usefull when analysing earthquakes in USA and it will mostly be ignored in further analysis.
Region code value counts: 10.0 75 15.0 107 20.0 4 30.0 1045 40.0 305 50.0 120 60.0 472 70.0 13 80.0 1 90.0 165 100.0 169 110.0 50 120.0 127 130.0 846 140.0 810 150.0 490 160.0 600 170.0 808 Name: Region code, dtype: int64
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 6208 non-null int64 1 Month 5800 non-null float64 2 Day 5646 non-null float64 dtypes: float64(2), int64(1) memory usage: 194.0 KB
As we can see on chart above, dataset contains much more earthquakes happened in recent years. Of cource this is because of the fact that we have more advanced technology to measure earthquakes nowadays. This is main reason to drop some older years from dataset.
We can conclude that month does not influence the number of earthquakes happening, as we cannot extract some general rule from chart above.
| Mw Magnitude | Ms Magnitude | Mb Magnitude | Ml Magnitude | MFA Magnitude | Unknown Magnitude | |
|---|---|---|---|---|---|---|
| ID Earthquake | ||||||
| 10245 | 6.9 | NaN | NaN | NaN | NaN | NaN |
| 10267 | 7.1 | NaN | NaN | NaN | NaN | NaN |
| 10367 | 5.3 | NaN | NaN | NaN | NaN | NaN |
| 10430 | NaN | NaN | NaN | NaN | NaN | 3.8 |
| 10515 | 6.6 | NaN | NaN | NaN | NaN | NaN |
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Mw Magnitude 1334 non-null float64 1 Ms Magnitude 2930 non-null float64 2 Mb Magnitude 1804 non-null float64 3 Ml Magnitude 184 non-null float64 4 MFA Magnitude 14 non-null float64 5 Unknown Magnitude 777 non-null float64 dtypes: float64(6) memory usage: 339.5 KB
| Mw Magnitude | Ms Magnitude | Mb Magnitude | Ml Magnitude | MFA Magnitude | Unknown Magnitude | |
|---|---|---|---|---|---|---|
| count | 1334.000000 | 2930.000000 | 1804.000000 | 184.000000 | 14.000000 | 777.000000 |
| mean | 6.513193 | 6.574198 | 5.792572 | 5.395109 | 6.771429 | 6.652638 |
| std | 0.928359 | 0.990792 | 0.724433 | 1.087850 | 1.230027 | 1.007854 |
| min | 3.600000 | 2.100000 | 2.100000 | 1.600000 | 4.300000 | 3.200000 |
| 25% | 5.800000 | 5.800000 | 5.300000 | 4.775000 | 6.225000 | 6.000000 |
| 50% | 6.500000 | 6.600000 | 5.800000 | 5.450000 | 7.050000 | 6.800000 |
| 75% | 7.200000 | 7.300000 | 6.300000 | 6.025000 | 7.475000 | 7.500000 |
| max | 9.500000 | 9.100000 | 8.200000 | 7.700000 | 8.500000 | 8.800000 |
As we can see from heatplot correlation between different magnitudes are very high and also their distributions does not differ a lot. Let's see which magnitude is usually taken for EQ Primary measure, but whatever scale is most used, because of high correlations, for further analysis it would probably be enough to just look at EQ Primary measure.
All EQ Primary measures are taken from magnitudes if they are not null!
So we can claim that EQ Primary is just inferred from other magnitudes.
From last plot we can see that Ms, Mw and Unknown magnitudes are mostly used. All of them are very highly correlated by pairs (> 0.94), so it is safe enough to observe only EQ Primary in later analysis.
| Earthquake : Deaths | Earthquake : Missing | Earthquake : Injuries | Earthquake : Damage (in M$) | Earthquake : Houses Destroyed | Earthquake : Houses Damaged | Earthquake : Deaths Description | Earthquake : Missing Description | Earthquake : Injuries Description | Earthquake : Damage Description | Earthquake : Houses Destroyed Description | Earthquake : Houses Damaged Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | ||||||||||||
| 10245 | NaN | NaN | 5.0 | NaN | 1.0 | NaN | NaN | NaN | Few (~1 to 50 deaths) | LIMITED (roughly corresponding to less than $1... | Few (~1 to 50 houses) | Few (~1 to 50 houses) |
| 10267 | 369.0 | NaN | 6000.0 | 8000.000 | 226.0 | 184000.0 | Many (~101 to 1000 deaths) | NaN | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | Many (~101 to 1000 houses) | Very Many (~1001 or more houses) |
| 10367 | 1.0 | NaN | NaN | NaN | NaN | NaN | Few (~1 to 50 deaths) | NaN | NaN | NaN | NaN | NaN |
| 10430 | 1.0 | NaN | NaN | NaN | NaN | NaN | Few (~1 to 50 deaths) | NaN | Few (~1 to 50 deaths) | LIMITED (roughly corresponding to less than $1... | NaN | Few (~1 to 50 houses) |
| 10515 | 1.0 | NaN | 51.0 | 0.565 | 51.0 | 453.0 | Few (~1 to 50 deaths) | NaN | Some (~51 to 100 deaths) | LIMITED (roughly corresponding to less than $1... | Some (~51 to 100 houses) | Many (~101 to 1000 houses) |
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Earthquake : Deaths 2069 non-null float64 1 Earthquake : Missing 21 non-null float64 2 Earthquake : Injuries 1244 non-null float64 3 Earthquake : Damage (in M$) 511 non-null float64 4 Earthquake : Houses Destroyed 786 non-null float64 5 Earthquake : Houses Damaged 490 non-null float64 6 Earthquake : Deaths Description 2551 non-null object 7 Earthquake : Missing Description 21 non-null object 8 Earthquake : Injuries Description 1432 non-null object 9 Earthquake : Damage Description 4446 non-null object 10 Earthquake : Houses Destroyed Description 1704 non-null object 11 Earthquake : Houses Damaged Description 940 non-null object dtypes: float64(6), object(6) memory usage: 759.5+ KB
| Earthquake : Deaths | Earthquake : Missing | Earthquake : Injuries | Earthquake : Damage (in M$) | Earthquake : Houses Destroyed | Earthquake : Houses Damaged | |
|---|---|---|---|---|---|---|
| count | 2069.000000 | 21.000000 | 1244.000000 | 511.000000 | 7.860000e+02 | 4.900000e+02 |
| mean | 3748.109715 | 2182.761905 | 2173.584405 | 1252.089894 | 1.762395e+04 | 2.513319e+04 |
| std | 25333.721982 | 9463.737100 | 26271.351391 | 6733.908952 | 1.971549e+05 | 2.496489e+05 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.013000 | 1.000000e+00 | 1.000000e+00 |
| 25% | 3.000000 | 5.000000 | 10.000000 | 3.950000 | 6.425000e+01 | 9.000000e+01 |
| 50% | 22.000000 | 21.000000 | 40.000000 | 22.000000 | 5.060000e+02 | 6.605000e+02 |
| 75% | 305.000000 | 114.000000 | 200.000000 | 200.000000 | 4.000000e+03 | 3.465250e+03 |
| max | 830000.000000 | 43476.000000 | 799000.000000 | 100000.000000 | 5.360000e+06 | 5.360000e+06 |
From this statistics we can see that for this group we have a lot of null values.
Also, worth noticing is that maximum amount of casualties from earthquake effects is 830000 people.
Let's see what earthquake caused this amount of casualties.
Flag Tsunami False Year 1556 Month 1 Day 23 Focal Depth NaN EQ Primary 8.0 Mw Magnitude NaN Ms Magnitude 8.0 Mb Magnitude NaN Ml Magnitude NaN MFA Magnitude NaN Unknown Magnitude NaN Intensity 11.0 Country CHINA State NaN Location name CHINA: SHAANXI PROVINCE Region code 30 Earthquake : Deaths 830000.0 Earthquake : Deaths Description Very Many (~1001 or more deaths) Earthquake : Missing NaN Earthquake : Missing Description NaN Earthquake : Injuries NaN Earthquake : Injuries Description NaN Earthquake : Damage (in M$) NaN Earthquake : Damage Description EXTREME (~$25 million or more) Earthquake : Houses Destroyed NaN Earthquake : Houses Destroyed Description NaN Earthquake : Houses Damaged NaN Earthquake : Houses Damaged Description NaN Total Effects : Deaths 830000.0 Total Effects : Deaths Description Very Many (~1001 or more deaths) Total Effects : Missing NaN Total Effects : Missing Description NaN Total Effects : Injuries NaN Total Effects : Injuries Description NaN Total Effects : Damages in million Dollars NaN Total Effects : Damage Description EXTREME (~$25 million or more) Total Effects : Houses Destroyed NaN Total Effects : Houses Destroyed Description NaN Total Effects : Houses Damaged NaN Total Effects : Houses Damaged Description NaN Latitude 34.5 Longitude 109.7 Name: 732, dtype: object
Plotting univariate distributions
From these distributions, because of the scale, we can see that all of them have some outliers with very large values. These outliers are probably earthquakes with most damage, deaths and other disasterous effects in history. We will further investigate that later, but for now it is important to notice them.
From correlation matrix we can see that number of deaths and missing are perfectly correlated. Combining that with the fact that there are only 21 non-null values of missing people, this column will not be particulary useful.
Number of houses damaged and destroyed are also perfectly correlated and because of that it will probably be enough to retain just one of these columns.
Beside that we can see that number of injuries is highly correlated with damaged and destroyed houses.
We can see that except damaged and destroyed houses number of earthquakes with more severe effects is less than with moderate and small effects. In these columns we have two None values, whose meaning is not clear yet. These distributions also does not take null values into account (there is a lot of them), so this analysis will be more meaningful when we use only newer data (not in whole history). Then one of the questions is to determine meaning of null values.
We can see that names of columns for total and earhquake effects are same. Because of that, first important question here is to see how much values differ in corresponding columns.
| Total Effects : Deaths | Total Effects : Missing | Total Effects : Injuries | Total Effects : Damage (in M$) | Total Effects : Houses Destroyed | Total Effects : Houses Damaged | Total Effects : Deaths Description | Total Effects : Missing Description | Total Effects : Injuries Description | Total Effects : Damage Description | Total Effects : Houses Destroyed Description | Total Effects : Houses Damaged Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | ||||||||||||
| 10245 | NaN | NaN | 7.0 | NaN | 1.0 | NaN | NaN | NaN | Few (~1 to 50 deaths) | LIMITED (roughly corresponding to less than $1... | Few (~1 to 50 houses) | Few (~1 to 50 houses) |
| 10267 | 369.0 | NaN | 6000.0 | 8000.000 | 226.0 | 184000.0 | Many (~101 to 1000 deaths) | NaN | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | Many (~101 to 1000 houses) | Very Many (~1001 or more houses) |
| 10367 | 1.0 | NaN | NaN | NaN | NaN | NaN | Few (~1 to 50 deaths) | NaN | NaN | NaN | NaN | NaN |
| 10430 | 1.0 | NaN | NaN | NaN | NaN | NaN | Few (~1 to 50 deaths) | NaN | Few (~1 to 50 deaths) | LIMITED (roughly corresponding to less than $1... | NaN | Few (~1 to 50 houses) |
| 10515 | 1.0 | NaN | 51.0 | 0.565 | 51.0 | 453.0 | Few (~1 to 50 deaths) | NaN | Some (~51 to 100 deaths) | LIMITED (roughly corresponding to less than $1... | Some (~51 to 100 houses) | Many (~101 to 1000 houses) |
<class 'pandas.core.frame.DataFrame'> Int64Index: 6208 entries, 78 to 10515 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Total Effects : Deaths 1702 non-null float64 1 Total Effects : Missing 25 non-null float64 2 Total Effects : Injuries 1259 non-null float64 3 Total Effects : Damage (in M$) 456 non-null float64 4 Total Effects : Houses Destroyed 817 non-null float64 5 Total Effects : Houses Damaged 428 non-null float64 6 Total Effects : Deaths Description 2040 non-null object 7 Total Effects : Missing Description 26 non-null object 8 Total Effects : Injuries Description 1441 non-null object 9 Total Effects : Damage Description 3293 non-null object 10 Total Effects : Houses Destroyed Description 1784 non-null object 11 Total Effects : Houses Damaged Description 821 non-null object dtypes: float64(6), object(6) memory usage: 759.5+ KB
| Total Effects : Deaths | Total Effects : Missing | Total Effects : Injuries | Total Effects : Damage (in M$) | Total Effects : Houses Destroyed | Total Effects : Houses Damaged | |
|---|---|---|---|---|---|---|
| count | 1702.000000 | 25.00000 | 1259.000000 | 456.000000 | 8.170000e+02 | 4.280000e+02 |
| mean | 4228.737368 | 1910.68000 | 2379.995234 | 1892.290559 | 1.819171e+04 | 5.882836e+04 |
| std | 28267.559410 | 8667.79685 | 27424.400752 | 12469.580794 | 1.950541e+05 | 1.015323e+06 |
| min | 1.000000 | 1.00000 | 1.000000 | 0.010000 | 1.000000e+00 | 1.000000e+00 |
| 25% | 3.000000 | 5.00000 | 10.000000 | 4.460000 | 6.100000e+01 | 9.000000e+01 |
| 50% | 20.000000 | 21.00000 | 40.000000 | 29.000000 | 5.000000e+02 | 6.465000e+02 |
| 75% | 289.500000 | 138.00000 | 200.000000 | 292.500000 | 3.600000e+03 | 2.850000e+03 |
| max | 830000.000000 | 43476.00000 | 799000.000000 | 220085.456000 | 5.360000e+06 | 2.100000e+07 |
Even visualy we can see how this matrix seems symmteric over it's antidiagonal. Correlation values between corresponding columns (earhquake and total effects) are:
So except Damage (in M$) column other pairs of clumns are very highly correlated and there is no point on analysisng both earthquake and total effects separately for numerical columns.
Mean values by columns Earthquake : Deaths -> 3748.109714838086, Total Effects : Deaths -> 4228.737367802585 Earthquake : Missing -> 2182.7619047619046, Total Effects : Missing -> 1910.68 Earthquake : Injuries -> 2173.5844051446948, Total Effects : Injuries -> 2379.9952343129466 Earthquake : Damage (in M$) -> 1252.0898943248533, Total Effects : Damage (in M$) -> 1892.2905592105262 Earthquake : Houses Destroyed -> 17623.946564885497, Total Effects : Houses Destroyed -> 18191.71481028152 Earthquake : Houses Damaged -> 25133.18775510204, Total Effects : Houses Damaged -> 58828.35514018692
So here, we can see that for almost all columns (except missing which has a lot of null values), total effects have bigger values (which is expected).
Jaccard similarity for categorical columns Earthquake : Deaths Description, Total Effects : Deaths Description -> 0.9559331290243338 Earthquake : Missing Description, Total Effects : Missing Description -> 0.9035087719298246 Earthquake : Injuries Description, Total Effects : Injuries Description -> 0.9859108466575434 Earthquake : Damage Description, Total Effects : Damage Description -> 0.9573823255610486 Earthquake : Houses Destroyed Description, Total Effects : Houses Destroyed Description -> 0.9513708535574468 Earthquake : Houses Damaged Description, Total Effects : Houses Damaged Description -> 0.9309467061931035
This confrims that categorical column representatives of these groups are very highly correlated.
We will retain only one of these two groups for further analysis.
From basic info category flag tsunami and EQ Primary columns will definetely stay in dataset. We have around 50% of null values for focal depth and intensity of earthquakes. Because later we will also drop a lot rows, we will decide later what to do with these columns.
For location columns we will drop region code although we have all data available (there are only 18 distinct values and focus in analysis will be on countries). State column will be retained only for analysis of earthquakes in USA. Location name, will be retained along with countries and latitude and longitude data.
We can see that all data related to date when earthquakes happened are usually present, so these columns will be retained.
For magnitudes we will retain only EQ Primary column, because of reasons given above.
From the chart above we can see that there are more available data for earthquake effects than total effects. Because of that we will retain earthquake effects columns for further analysis and drop total effects columns.
Dataset rows: 6208, columns: 24
We can see that there is very small amount of earthquakes recorded before 19th century. Because of that in order to have more consistent analysis we will drop from dataset all earthquakes that happended before that period.
Again we can see steady increase of recorded earthquakes over decades from 19th to 21st century. Question is why is that happening? One of the certain reasons is that with development of technologies our recordings of different factors of earthquakes are more accurate. Let's see what of the 5 reasons (caused deaths, more than 1 million dollars damage, magnitude 7.5 or greater, intensity 10 or more, genarated tsunami) was most common and how that distribution changes over decades.
<AxesSubplot: title={'center': 'Number of earthquakes per decade grouped by satisfied conditions'}, xlabel='Decade'>
Main takeaway here is that earthquakes that have enough damage or number of deaths to be classified as significant increased a lot in more recent decades. We can see that magnittude and intensity caused more earthquakes to be significant in some past decades (peek around 1900. year), so they did not contributed the trend of increasing number of significant earthquakes. We can also observe that number of tsunamis as a couse in most of decades (from 1850. to now) did not changed a lot.
With this we can conclude that there are more earthquakes recorded in this dataset in recent past, because earthqukes now cause more damage and deaths (at least recorded ones). This can be due to more advanced technology that alows better tracking of material damage and deaths in earthquakes, but also due to more people and cities in areas that are prone to earthquakes.
In order to get more consistent analysis, based on previous analysis we will retain only earthquakes that happended after 1960. year.
Now we will drop some columns where there are not enough present values. From earthquake effects we will drop missing missing people, houses damaged, houses destroyed and damage in M$ columns. Description of damage, injuries and deaths can stay in dataset. State has small amount of values because it is restricted for USA earthquakes only, so it will remain present in dataset. We will drop Intensity column (Mercalli scale), because of many null values (so for measures of magnitudes we will use just EQ Primary).
<class 'pandas.core.frame.DataFrame'> Int64Index: 2426 entries, 4216 to 10515 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Flag Tsunami 2426 non-null bool 1 Year 2426 non-null int64 2 Month 2426 non-null Int32 3 Day 2426 non-null Int32 4 Focal Depth 2352 non-null float64 5 EQ Primary 2395 non-null float64 6 Country 2426 non-null object 7 State 143 non-null object 8 Location name 2426 non-null object 9 Earthquake : Deaths 1053 non-null float64 10 Earthquake : Deaths Description 1077 non-null object 11 Earthquake : Injuries 1078 non-null float64 12 Earthquake : Injuries Description 1186 non-null object 13 Earthquake : Damage Description 1990 non-null object 14 Latitude 2425 non-null object 15 Longitude 2425 non-null object dtypes: Int32(2), bool(1), float64(4), int64(1), object(8) memory usage: 291.4+ KB
| Year | Month | Day | Focal Depth | EQ Primary | Earthquake : Deaths | Earthquake : Injuries | |
|---|---|---|---|---|---|---|---|
| count | 2426.000000 | 2426.0 | 2426.0 | 2352.000000 | 2395.000000 | 1053.000000 | 1078.000000 |
| mean | 1994.679720 | 6.470734 | 15.791838 | 33.242347 | 6.118706 | 1183.420703 | 2360.817254 |
| std | 17.489439 | 3.420401 | 8.752634 | 58.395766 | 1.033245 | 13177.577957 | 28174.601784 |
| min | 1960.000000 | 1.0 | 1.0 | 0.000000 | 1.600000 | 1.000000 | 1.000000 |
| 25% | 1980.000000 | 4.0 | 8.0 | 10.000000 | 5.400000 | 2.000000 | 9.000000 |
| 50% | 1999.000000 | 7.0 | 16.0 | 21.000000 | 6.100000 | 5.000000 | 36.000000 |
| 75% | 2009.000000 | 9.0 | 23.0 | 33.000000 | 6.900000 | 29.000000 | 200.000000 |
| max | 2020.000000 | 12.0 | 31.0 | 675.000000 | 9.500000 | 316000.000000 | 799000.000000 |
<AxesSubplot: title={'center': 'Focal depth distribution'}, ylabel='Frequency'>
Because focal depth can have some large numbers, we will impute missing values with median value of this column instead of mean.
Number of missing values for focal depth: 74 Imputing focal depth with median value: 21.0 Number of missing values for focal depth after imputing: 0
<AxesSubplot: title={'center': 'EQ Primary distribution'}, ylabel='Frequency'>
We can impute missing values for EQ Primary with mean value of this column.
Number of missing values for focal depth: 31 Imputing eq primary with median value: 6.118705636743215 Number of missing values for eq primary after imputing: 0
Now let's see what null values we have left
<class 'pandas.core.frame.DataFrame'> Int64Index: 2426 entries, 4216 to 10515 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Flag Tsunami 2426 non-null bool 1 Year 2426 non-null int64 2 Month 2426 non-null Int32 3 Day 2426 non-null Int32 4 Focal Depth 2426 non-null float64 5 EQ Primary 2426 non-null float64 6 Country 2426 non-null object 7 State 143 non-null object 8 Location name 2426 non-null object 9 Earthquake : Deaths 1053 non-null float64 10 Earthquake : Deaths Description 1077 non-null object 11 Earthquake : Injuries 1078 non-null float64 12 Earthquake : Injuries Description 1186 non-null object 13 Earthquake : Damage Description 1990 non-null object 14 Latitude 2425 non-null object 15 Longitude 2425 non-null object dtypes: Int32(2), bool(1), float64(4), int64(1), object(8) memory usage: 291.4+ KB
We can see that there is still one earthquake that does not have latitude and longitude. Let's see what that earthquake is and decide what to do with it.
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | ||||||||||||||||
| 7775 | True | 1978 | 6 | 22 | 21.0 | 6.118706 | ITALY | NaN | ITALY: ADRIATIC SEA | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
We can see that there is really little data about it and 6.12 magnitude is not that big, so we will drop this row.
<class 'pandas.core.frame.DataFrame'> Int64Index: 2425 entries, 4216 to 10515 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Flag Tsunami 2425 non-null bool 1 Year 2425 non-null int64 2 Month 2425 non-null Int32 3 Day 2425 non-null Int32 4 Focal Depth 2425 non-null float64 5 EQ Primary 2425 non-null float64 6 Country 2425 non-null object 7 State 143 non-null object 8 Location name 2425 non-null object 9 Earthquake : Deaths 1053 non-null float64 10 Earthquake : Deaths Description 1077 non-null object 11 Earthquake : Injuries 1078 non-null float64 12 Earthquake : Injuries Description 1186 non-null object 13 Earthquake : Damage Description 1990 non-null object 14 Latitude 2425 non-null object 15 Longitude 2425 non-null object dtypes: Int32(2), bool(1), float64(4), int64(1), object(8) memory usage: 291.3+ KB
We will not impute null values for number of deaths and injuries, for now, because there are too many missing values. Maybe later we will be able to proveide more context for these values.
In previous section we already saw some analysis of earthquake occurences by decades and centuries, but that was done on some larger parts of dataset and for the purpose of data selection.
First we will convert year, month and date to one datetime column, because that will enable easier implementation in some situations.
| Year | Month | Day | Date | |
|---|---|---|---|---|
| ID Earthquake | ||||
| 4216 | 1960 | 2 | 29 | 1960-02-29 |
| 4221 | 1960 | 4 | 29 | 1960-04-29 |
| 4257 | 1962 | 2 | 14 | 1962-02-14 |
| 4293 | 1963 | 5 | 19 | 1963-05-19 |
| 4313 | 1964 | 4 | 2 | 1964-04-02 |
<AxesSubplot: title={'center': 'Number of earthquakes per year'}, xlabel='Year'>
We can see that starting from around 2000. year, there is large increase of number of significant earthquakes. In 2020. year we have big decrease, so now we will investigate the reason for that. It is possibility that we do not have complete data for the last year.
<AxesSubplot: title={'center': 'Number of earthquakes per month in 2020'}, xlabel='Date'>
As we assumed, there is no data present after august 2020. year. We need to keep this in mind and maybe drop this year in some of the future analysis.
<AxesSubplot: title={'center': 'Number of earthquakes per month in 1960-2019'}, xlabel='Date'>
As noticed on the whole dataset previously, we do not have big differences of earthquakes happening in different months, so thay are spread evenly, across the year.
<AxesSubplot: title={'center': 'Number of earthquakes per days in a week from 1960-2020'}, xlabel='Date'>
Same as for statistics by month, we can see that there is no big difference in number of earthquakes happening in different days of the week.
Let's see how magnitude values of earthquakes changed over the years.
<AxesSubplot: title={'center': 'Magnitude statistics by year'}, xlabel='Year'>
For the maximum magnitude we can notice more significant fluctuations in early years, but all the time they stayed within [7.5, 9.5] interval, so there is no obvious trend here, that for example we have stronger earthquakes in recent years.
In terms of average magnitude, we can see that it is slightly lower in recent years. That is probably due to having more significant earthquakes in the dataset caused by damage and deaths category, so some of the earthquakes with lower magnitude are also included in dataset and they slighly decrease the mean value. This also explains even more obvious trend for minimum values.
Main conclusion here is that we do not have change in how strong earthquakes are recently, comparing to past decades, but more earthquakes are categorized as significant.
<AxesSubplot: title={'center': 'Focal depth statistics by year'}, xlabel='Year'>
We can see that minimums and means stayed same over the years. Maximum values changes a lot, but we cannot say that now we have earthquakes generated on deeper or shallower focal pointns than in the past.
<AxesSubplot: title={'center': 'Number of tsunamis per year'}, xlabel='Year'>
<AxesSubplot: title={'center': 'Percentage of tsunamis per year'}, xlabel='Year'>
From these two charts we can see that number of tsunamis over the years fluctuated a lot [2 to 18], but there is no obvious trend that we have more or less tsunamis in recent years. We can alo notice that there is smaller percentage of earthquakes that are tsunamis in recent years and that is again due to larger number of regular earthquakes (that did not caused tsunamis).
| plate | lat | lon | |
|---|---|---|---|
| 0 | am | 30.754 | 132.824 |
| 1 | am | 30.970 | 132.965 |
| 2 | am | 31.216 | 133.197 |
| 3 | am | 31.515 | 133.500 |
| 4 | am | 31.882 | 134.042 |
Earthquake occurence by time
We will analyse earthquakes in Turkey and Serbia (and Balkan region).
Number if earthquakes in Turkey: 100 Number of earthquakes per year in Turkey: 1.639344262295082
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 5795 | False | 2004 | 8 | 4 | 10.0 | 5.6 | TURKEY | NaN | TURKEY: BODRUM | NaN | NaN | 15.0 | Few (~1 to 50 deaths) | NaN | 36.833 | 27.815 | 2004-08-04 |
| 4488 | False | 1969 | 4 | 30 | 9.0 | 5.1 | TURKEY | NaN | TURKEY: DEMIRCI, WESTERN ANATOLIA, ISTANBUL | NaN | NaN | NaN | NaN | MODERATE (~$1 to $5 million) | 39.200 | 28.600 | 1969-04-30 |
| 5547 | False | 1999 | 12 | 3 | 19.0 | 5.7 | TURKEY | NaN | TURKEY: GORESKEN, ERZURUM PROVINCE | 1.0 | Few (~1 to 50 deaths) | 6.0 | Few (~1 to 50 deaths) | MODERATE (~$1 to $5 million) | 40.358 | 42.346 | 1999-12-03 |
| 5767 | False | 2004 | 3 | 25 | 21.0 | 5.6 | TURKEY | NaN | TURKEY: ERZURUM | 10.0 | Few (~1 to 50 deaths) | 46.0 | Few (~1 to 50 deaths) | LIMITED (roughly corresponding to less than $1... | 39.930 | 40.812 | 2004-03-25 |
| 9833 | False | 2011 | 5 | 19 | 7.0 | 4.3 | TURKEY | NaN | TURKEY: SIMAV | 2.0 | Few (~1 to 50 deaths) | 125.0 | Many (~101 to 1000 deaths) | LIMITED (roughly corresponding to less than $1... | 39.120 | 29.040 | 2011-05-19 |
So in Turkey we have eastern and western region where earthquakes happen. Let's see position of Turkey in some statistics.
Turkey is on 37th place in the world (out of 127 countries) by maximum magnitude earthquake.
Turkey is on 6th place in the world (out of 127 countries) by number of earthquakes.
Turkey is on 7th place in the world (out of 127 countries) by number of deaths.
So Turkey did not have aerthquakes with very big magnitudes (37th in world) but had a lot of deaths caused by earthquakes and a lot of damage. This tells us that regions where earthquakes happen probably big density of piopulation and that infrastructure is not good enough to withstand earthquakes.
Now we will consider how expected was this earthquake that happend in 2023. Some of it's statistics are:
Maximum magnitude of earthquake in Turkey (1960 - 2020): 7.6 Maximum magnitude of earthquake in Turkey in whole history: 7.6
Text(0.5, 1.0, 'Distribution of magnitudes in Turkey')
From this statistics we can see that earthquake that happende in 2023. in Turkey is strongest in history for Turkey. Let's explore more about it's location.
Although this earthquake is placed near edge of tectonic plate, looking at past earthquakes in that region this one was extremely strong in magnitude. Looking into map for death count eastern region of the country had the most extreme cases, but this is still more southern than expected.
Maximum number of deaths in Turkey (1960 - 2020): 17118.0
This number compared to more than 50000 is also unexpected, but this was strongest earthquake in Turkey's history in unexpected place, also near some cities, so that can explain that number. Also Turkey had a dozen earthquakes in extreme category (for deaths) in the past.
Number if earthquakes in Serbia: 8 Number of earthquakes per year in Serbia: 0.13114754098360656
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 4989 | False | 1983 | 9 | 10 | 10.0 | 5.1 | SERBIA | NaN | BALKANS NW: SERBIA | NaN | NaN | NaN | NaN | MODERATE (~$1 to $5 million) | 43.246 | 20.859 | 1983-09-10 |
| 5044 | False | 1984 | 9 | 7 | 13.0 | 4.7 | SERBIA | NaN | BALKANS NW: SERBIA | NaN | NaN | 2.0 | Few (~1 to 50 deaths) | MODERATE (~$1 to $5 million) | 43.314 | 20.957 | 1984-09-07 |
| 5505 | False | 1998 | 9 | 29 | 10.0 | 5.5 | SERBIA | NaN | BALKANS NW: SERBIA: BELGRADE, LJIG, VALJEVO | 1.0 | Few (~1 to 50 deaths) | 17.0 | Few (~1 to 50 deaths) | MODERATE (~$1 to $5 million) | 44.209 | 20.080 | 1998-09-29 |
| 10137 | False | 2015 | 3 | 8 | 20.0 | 4.4 | SERBIA | NaN | BALKANS NW: SERBIA: KOSJERIC | NaN | NaN | NaN | NaN | LIMITED (roughly corresponding to less than $1... | 44.088 | 19.861 | 2015-03-08 |
| 5632 | False | 2002 | 4 | 24 | 10.0 | 5.7 | SERBIA | NaN | BALKANS NW: KOSOVO; MACEDONIA: N | 1.0 | Few (~1 to 50 deaths) | 60.0 | Some (~51 to 100 deaths) | LIMITED (roughly corresponding to less than $1... | 42.436 | 21.466 | 2002-04-24 |
| 4800 | False | 1978 | 4 | 13 | 33.0 | 5.7 | SERBIA | NaN | BALKANS NW: SERBIA: BRUS | NaN | NaN | NaN | NaN | MODERATE (~$1 to $5 million) | 43.269 | 20.919 | 1978-04-13 |
| 4877 | False | 1980 | 5 | 18 | 9.0 | 5.8 | SERBIA | NaN | BALKANS NW: SERBIA | NaN | NaN | 30.0 | Few (~1 to 50 deaths) | MODERATE (~$1 to $5 million) | 43.294 | 20.837 | 1980-05-18 |
| 9632 | False | 2010 | 11 | 3 | 1.0 | 5.5 | SERBIA | NaN | BALKANS NW: SERBIA: KRALJEVO | 2.0 | Few (~1 to 50 deaths) | 100.0 | Some (~51 to 100 deaths) | SEVERE (~>$5 to $24 million) | 43.760 | 20.673 | 2010-11-03 |
So from this we can conclude that around every tenth year Serbia has one significant earthquake. Let's see them on a map.
Map of earthquakes in Serbia with magnitude
Map of earthquakes in Serbia by damage
So on these maps we can see that earthquake in Kraljevo 2010. year, caused most damage. We also have region around Valjevo which is active and on Kopaonik (which did not caused big material damage).
It is also worth noting that we have region around Skoplje which can have very strong erathquakes near border, so that earthquakes can also affect soutern regions of country.
Let's see where is Serbia positioned worldwide in statistics for various parameters.
Serbia is on 99th place in the world (out of 127 countries) by maximum magnitude earthquake.
Serbia is on 46th place in the world (out of 127 countries) by number of earthquakes.
Serbia is on 75th place in the world (out of 127 countries) by number of deaths.
So Serbia had more significant earthquakes, but thay did not have very big magnitudes and it also belongs to countries with average number of deaths and material damage caused by earthquakes.
LIMITED (roughly corresponding to less than $1 million) 728 MODERATE (~$1 to $5 million) 698 Unknown 435 SEVERE (~>$5 to $24 million) 301 EXTREME (~$25 million or more) 263 Name: Earthquake : Damage Description, dtype: int64
<AxesSubplot: xlabel='Earthquake : Damage Description', ylabel='EQ Primary'>
<AxesSubplot: xlabel='Earthquake : Damage Description', ylabel='Focal Depth'>
Text(0, 0.5, 'Number of earthquakes')
From this we can see that for most of tsunamis we do not have data about damage. Because of that it makes sense to view these two groups separately when analysing damage.
<AxesSubplot: title={'center': 'Damage statistics by year'}>
As we can see from the chart from 2000. to 2010. we had huge increase of number of earthquakes with limited damage. That still does not tell us a lot, because maybein recent years damage was better recorded. So, here we are more interested into severe and extreme damage categories and here we cannot see some obvious trend comparing to past decades, so number of earthquakes with really big damage stayed more or less the same throughout the years.
<AxesSubplot: title={'center': 'Number of extreme damage earthquakes per country (15 countries with greatest number)'}>
China here have a big lead (same as for number of earthquakes), USA was 5th for number of earthquakes, but here is second, Japan also has greater position here than for number of earthquakes. Italy have really big difference, baceuse it was 14th country for number of earthquakes, but here is 4th. We can conclude that here more developed countries have bigger material damage from same amount of earthquakes and that could be due to many factors. It is probably due to more exspensive infrastructure, but also due to more people living in big cities.
Number of null values for numerical column: 1372 Number of null values for categorical column: 1348
Because we have pretty much same number of values for both columns, it is better for further analysis to focus more on numerical data. We will also add no deaths category for description if there is no data about deaths. (this is just an assumption for now).
Properties of number of deaths
count 1053.000000 mean 1183.420703 std 13177.577957 min 1.000000 25% 2.000000 50% 5.000000 75% 29.000000 max 316000.000000 Name: Earthquake : Deaths, dtype: float64
So we can see that the smallest value here is 1. Because there are definitely some earthquakes without ony detahs recorded in this dataset (they were classified as significant because of other factors), there are probably a lot of earthquakes without casualties among null values. The hypothesis is that we can treat all missing values as earthquakes without deaths, but we need to test this more.
Beside that, we can see that here we have huge standard deviation which tells as that there is a lot of variation in number of deaths caused by earthquakes. There are some very few of very lethal earthquakes, but also many of them with no casualties at all.
Number of earthquakes per category
No deaths 1348 Few (~1 to 50 deaths) 853 Many (~101 to 1000 deaths) 94 Some (~51 to 100 deaths) 69 Very Many (~1001 or more deaths) 60 None 1 Name: Earthquake : Deaths Description, dtype: int64
We have one None value here. We will set it's value to No deaths.
Earthquakes with most death counts (first 10)
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 8732 | True | 2010 | 1 | 12 | 13.0 | 7.0 | HAITI | NaN | HAITI: PORT-AU-PRINCE | 316000.0 | Very Many (~1001 or more deaths) | 30000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 18.457 | -72.533 | 2010-01-12 |
| 4735 | False | 1976 | 7 | 27 | 23.0 | 7.5 | CHINA | NaN | CHINA: NE: TANGSHAN | 242769.0 | Very Many (~1001 or more deaths) | 799000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 39.570 | 117.980 | 1976-07-27 |
| 7843 | True | 2008 | 5 | 12 | 19.0 | 7.9 | CHINA | NaN | CHINA: SICHUAN PROVINCE | 87652.0 | Very Many (~1001 or more deaths) | 374171.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 31.002 | 103.322 | 2008-05-12 |
| 6778 | False | 2005 | 10 | 8 | 26.0 | 7.6 | PAKISTAN | NaN | PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA | 76213.0 | Very Many (~1001 or more deaths) | 146599.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 34.539 | 73.588 | 2005-10-08 |
| 4531 | True | 1970 | 5 | 31 | 43.0 | 7.9 | PERU | NaN | PERU: NORTHERN, PISCO, CHICLAYO | 66794.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -9.200 | -78.800 | 1970-05-31 |
| 5248 | True | 1990 | 6 | 20 | 19.0 | 7.3 | IRAN | NaN | IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL | 40000.0 | Very Many (~1001 or more deaths) | 105000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 36.957 | 49.409 | 1990-06-20 |
| 5751 | False | 2003 | 12 | 26 | 10.0 | 6.6 | IRAN | NaN | IRAN: SOUTHEASTERN: BAM, BARAVAT | 31000.0 | Very Many (~1001 or more deaths) | 30000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 28.995 | 58.311 | 2003-12-26 |
| 4600 | False | 1972 | 4 | 10 | 11.0 | 6.9 | IRAN | NaN | IRAN: QIR,KARZIN, JAHROM, FIRUZABAD | 30000.0 | Very Many (~1001 or more deaths) | 1700.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 28.400 | 52.800 | 1972-04-10 |
| 5184 | False | 1988 | 12 | 7 | 5.0 | 6.8 | ARMENIA | NaN | ARMENIA: LENINAKAN, SPITAK, KIROVAKAN | 25000.0 | Very Many (~1001 or more deaths) | NaN | NaN | EXTREME (~$25 million or more) | 40.987 | 44.185 | 1988-12-07 |
| 4711 | True | 1976 | 2 | 4 | 5.0 | 7.5 | GUATEMALA | NaN | GUATEMALA: CHIMALTENANGO, GUATEMALA CITY | 23000.0 | Very Many (~1001 or more deaths) | 76000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 15.324 | -89.101 | 1976-02-04 |
<AxesSubplot: title={'center': 'Distribution of number of deaths'}, ylabel='Frequency'>
Here we can spot some outliers (most devastating earthquakes), while most of them fall into category of 0 to 3000 deaths. Let's create same plot with limiting intervals to better see distribution for smaller values.
No deaths 1349 Few (~1 to 50 deaths) 853 Many (~101 to 1000 deaths) 94 Some (~51 to 100 deaths) 69 Very Many (~1001 or more deaths) 60 Name: Earthquake : Deaths Description, dtype: int64
Still have similar situations as before, so there are really many earthquakes with very small number of deaths and that scale is preserved for bigger values.
This is a chart with extreamly big peaks. These years that had really big numbers is due to one or few, very devastating earthquakes that year.
Text(0.5, 1.0, 'Relationship between number of deaths and magnitude')
Here we can see how these categories with more deaths required erathquakes to have higher magnitude. On the other hand there are also a lot of earthquakes that did not vaused deaths and also had big magnitudes (isolated areas, not populated areas, great infrastructure, etc.).
Text(0.5, 1.0, 'Relationship between number of deaths and magnitude - regression')
With this we can confirm that correlation between number of detahs and magnitude is positive, but not very strong.
Here we can see where in this space are earthquakes with most deaths compared to category with null values. It is visible that they are usually not occupying the same space and that is good for our hypothesis that we can treat null values as earthquakes with no deaths. They are much closer to earthquakes with few deaths and we can also spot 2 regions on chart with higher rate of erthquakes with big number of detahs. We can also see few eearthquakes with large amount of deaths a little bit isolated that have very big value for magnitude. Based on this analysis we will consider null valued number of deaths as zeros.
<AxesSubplot: title={'center': 'Number of deaths per country (15 countries with greatest number)'}, xlabel='Country'>
It is very interesting to compare this chart to the same one representing extreme damaging earthqakes. We saw there that devlepoted coutries (such as USA, Italy, Japan) had a lot of damage, but here we can see that they have much less deaths. That is probably due to better infrastructure and better preparedness of people for earthquakes.
Number of null values for numerical column: 1347 Number of null values for categorical column: 1239
These numbers of available values resenbles a lot number of deaths. Same as for deaths, we will assume that null values are earthquakes with no injuries (and try to prove that later)
Properties of number of deaths
count 1078.000000 mean 2360.817254 std 28174.601784 min 1.000000 25% 9.000000 50% 36.000000 75% 200.000000 max 799000.000000 Name: Earthquake : Injuries, dtype: float64
Here we have same observations as for deaths, but values are generally greater.
Number of earthquakes per category
No injuries 1239 Few (~1 to 50 deaths) 642 Many (~101 to 1000 deaths) 274 Some (~51 to 100 deaths) 158 Very Many (~1001 or more deaths) 112 Name: Earthquake : Injuries Description, dtype: int64
Earthquakes with most injuries counts (first 10)
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 4735 | False | 1976 | 7 | 27 | 23.0 | 7.5 | CHINA | NaN | CHINA: NE: TANGSHAN | 242769.0 | Very Many (~1001 or more deaths) | 799000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 39.570 | 117.980 | 1976-07-27 |
| 7843 | True | 2008 | 5 | 12 | 19.0 | 7.9 | CHINA | NaN | CHINA: SICHUAN PROVINCE | 87652.0 | Very Many (~1001 or more deaths) | 374171.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 31.002 | 103.322 | 2008-05-12 |
| 5589 | False | 2001 | 1 | 26 | 16.0 | 7.7 | INDIA | NaN | INDIA: GUJARAT: BHUJ, AHMADABAD, RAJOKOT; PA... | 20005.0 | Very Many (~1001 or more deaths) | 166836.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 23.419 | 70.232 | 2001-01-26 |
| 6778 | False | 2005 | 10 | 8 | 26.0 | 7.6 | PAKISTAN | NaN | PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA | 76213.0 | Very Many (~1001 or more deaths) | 146599.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 34.539 | 73.588 | 2005-10-08 |
| 5248 | True | 1990 | 6 | 20 | 19.0 | 7.3 | IRAN | NaN | IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL | 40000.0 | Very Many (~1001 or more deaths) | 105000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 36.957 | 49.409 | 1990-06-20 |
| 4711 | True | 1976 | 2 | 4 | 5.0 | 7.5 | GUATEMALA | NaN | GUATEMALA: CHIMALTENANGO, GUATEMALA CITY | 23000.0 | Very Many (~1001 or more deaths) | 76000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 15.324 | -89.101 | 1976-02-04 |
| 4531 | True | 1970 | 5 | 31 | 43.0 | 7.9 | PERU | NaN | PERU: NORTHERN, PISCO, CHICLAYO | 66794.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -9.200 | -78.800 | 1970-05-31 |
| 5527 | True | 1999 | 8 | 17 | 13.0 | 7.6 | TURKEY | NaN | TURKEY: ISTANBUL, KOCAELI, SAKARYA | 17118.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 40.760 | 29.970 | 1999-08-17 |
| 7245 | False | 2006 | 5 | 26 | 13.0 | 6.3 | INDONESIA | NaN | INDONESIA: JAVA: BANTUL, YOGYAKARTA | 5749.0 | Very Many (~1001 or more deaths) | 38568.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -7.961 | 110.446 | 2006-05-26 |
| 5399 | True | 1995 | 1 | 16 | 22.0 | 6.9 | JAPAN | NaN | JAPAN: SW HONSHU: KOBE, AWAJI-SHIMA, NISHINO... | 5502.0 | Very Many (~1001 or more deaths) | 36896.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 34.583 | 135.018 | 1995-01-16 |
We cannot see Haiti earthquake here, because it had only 30000 injured people. This could be easily mistake in the dataset, because some other sources reported around 10x more injured people in that earthquake. We will correct that now.
Everything else, compared to deaths, make sense.
Earthquakes with most injuries counts (first 10)
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 4735 | False | 1976 | 7 | 27 | 23.0 | 7.5 | CHINA | NaN | CHINA: NE: TANGSHAN | 242769.0 | Very Many (~1001 or more deaths) | 799000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 39.570 | 117.980 | 1976-07-27 |
| 7843 | True | 2008 | 5 | 12 | 19.0 | 7.9 | CHINA | NaN | CHINA: SICHUAN PROVINCE | 87652.0 | Very Many (~1001 or more deaths) | 374171.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 31.002 | 103.322 | 2008-05-12 |
| 8732 | True | 2010 | 1 | 12 | 13.0 | 7.0 | HAITI | NaN | HAITI: PORT-AU-PRINCE | 316000.0 | Very Many (~1001 or more deaths) | 300000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 18.457 | -72.533 | 2010-01-12 |
| 5589 | False | 2001 | 1 | 26 | 16.0 | 7.7 | INDIA | NaN | INDIA: GUJARAT: BHUJ, AHMADABAD, RAJOKOT; PA... | 20005.0 | Very Many (~1001 or more deaths) | 166836.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 23.419 | 70.232 | 2001-01-26 |
| 6778 | False | 2005 | 10 | 8 | 26.0 | 7.6 | PAKISTAN | NaN | PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA | 76213.0 | Very Many (~1001 or more deaths) | 146599.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 34.539 | 73.588 | 2005-10-08 |
| 5248 | True | 1990 | 6 | 20 | 19.0 | 7.3 | IRAN | NaN | IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL | 40000.0 | Very Many (~1001 or more deaths) | 105000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 36.957 | 49.409 | 1990-06-20 |
| 4711 | True | 1976 | 2 | 4 | 5.0 | 7.5 | GUATEMALA | NaN | GUATEMALA: CHIMALTENANGO, GUATEMALA CITY | 23000.0 | Very Many (~1001 or more deaths) | 76000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 15.324 | -89.101 | 1976-02-04 |
| 4531 | True | 1970 | 5 | 31 | 43.0 | 7.9 | PERU | NaN | PERU: NORTHERN, PISCO, CHICLAYO | 66794.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -9.200 | -78.800 | 1970-05-31 |
| 5527 | True | 1999 | 8 | 17 | 13.0 | 7.6 | TURKEY | NaN | TURKEY: ISTANBUL, KOCAELI, SAKARYA | 17118.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 40.760 | 29.970 | 1999-08-17 |
| 7245 | False | 2006 | 5 | 26 | 13.0 | 6.3 | INDONESIA | NaN | INDONESIA: JAVA: BANTUL, YOGYAKARTA | 5749.0 | Very Many (~1001 or more deaths) | 38568.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -7.961 | 110.446 | 2006-05-26 |
<AxesSubplot: title={'center': 'Distribution of number of injuries'}, ylabel='Frequency'>
No injuries 1239 Few (~1 to 50 deaths) 642 Many (~101 to 1000 deaths) 274 Some (~51 to 100 deaths) 158 Very Many (~1001 or more deaths) 112 Name: Earthquake : Injuries Description, dtype: int64
All this statistics are very similar to deaths. Also notice that categories here is sazing deaths instead of injuries, so we will correct that.
<AxesSubplot: title={'center': 'Number of injuries from earthquakes per year'}, xlabel='Year'>
This chart has even bigger peeks than for deths, but the pattern overall is same.
Text(0.5, 1.0, 'Relationship between number of injuries and magnitude')
This is again very similar to same chart for deaths.
Similar chart as for detahs data, but now we have more aerthquakes in the space with big number of injuries.
It seems now that there are more interlapping between categories with more injuries and category with null values.
Deaths null: 1372 Injuries null: 1347 Both null: 1010
As we can see most null values here overlap, so if we treated null values for deaths as zeros, we can do the same for injuries.
<AxesSubplot: title={'center': 'Number of injuries per country (15 countries with greatest number)'}, xlabel='Country'>
Here China is dominating by far, but conclusions are same as for deaths.
Number of tsunamis: 572 Number of normal earthquakes: 1853 Tsunami percentage: 23.59%
So roughly every forth earthquake caused tsunami.
Of course as expected we have tsunamies near shorelines, we can see very big concentration of them In Japan and Indonisian islands.
<AxesSubplot: title={'center': 'Number of tsunamies per year'}, xlabel='Year'>
So considering number of tsunamies we cannot see any obcvious trend, but it can definetely fluctuate from year to year.
<AxesSubplot: title={'center': 'Countries with most tsunamis (15 countries with greatest number)'}, xlabel='Country'>
Now Japan has most tsunamis, followed by Indonesia and Russia. So this is much different than general earthquake number, because now different regions are targeted.
<AxesSubplot: title={'center': 'Countries with most deaths caused by tsunamis (15 countries with greatest number)'}, xlabel='Country'>
So Haiti is first because of big tsunami that happened there. We can also see how Japan is here at only 10th place, although it is a country with most tsunamies. That is probably due to great protection of tsunamies and earthquakes in that country. Russia is also only in 14th place, while it was in 3rd place for number of tsunamies.
Average deaths from tsunamis: 3688.497005988024 Average deaths from normal earthquakes: 711.244920993228 Max deaths from tsunamis: 316000.0 Max deaths from normal earthquakes: 242769.0
So at average tsunamies caused more detahs than normal earthquakes, but maximal values belong to normal earthquakes.
<AxesSubplot: title={'center': 'Number of deaths distribution for tsunamies'}, ylabel='Frequency'>
This still looks similar as for normal earthquakes, but it is a little bit more spread.
<AxesSubplot: xlabel='Year'>
So recently we can see that tsunamies caused more deaths than normal earthquakes (around 2005-2010). Let's discover which tsunamies caused that.
| Flag Tsunami | Year | Month | Day | Focal Depth | EQ Primary | Country | State | Location name | Earthquake : Deaths | Earthquake : Deaths Description | Earthquake : Injuries | Earthquake : Injuries Description | Earthquake : Damage Description | Latitude | Longitude | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID Earthquake | |||||||||||||||||
| 8732 | True | 2010 | 1 | 12 | 13.0 | 7.0 | HAITI | NaN | HAITI: PORT-AU-PRINCE | 316000.0 | Very Many (~1001 or more deaths) | 30000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 18.457 | -72.533 | 2010-01-12 |
| 7843 | True | 2008 | 5 | 12 | 19.0 | 7.9 | CHINA | NaN | CHINA: SICHUAN PROVINCE | 87652.0 | Very Many (~1001 or more deaths) | 374171.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 31.002 | 103.322 | 2008-05-12 |
| 4531 | True | 1970 | 5 | 31 | 43.0 | 7.9 | PERU | NaN | PERU: NORTHERN, PISCO, CHICLAYO | 66794.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | -9.200 | -78.800 | 1970-05-31 |
| 5248 | True | 1990 | 6 | 20 | 19.0 | 7.3 | IRAN | NaN | IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL | 40000.0 | Very Many (~1001 or more deaths) | 105000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 36.957 | 49.409 | 1990-06-20 |
| 4711 | True | 1976 | 2 | 4 | 5.0 | 7.5 | GUATEMALA | NaN | GUATEMALA: CHIMALTENANGO, GUATEMALA CITY | 23000.0 | Very Many (~1001 or more deaths) | 76000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 15.324 | -89.101 | 1976-02-04 |
| 5527 | True | 1999 | 8 | 17 | 13.0 | 7.6 | TURKEY | NaN | TURKEY: ISTANBUL, KOCAELI, SAKARYA | 17118.0 | Very Many (~1001 or more deaths) | 50000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 40.760 | 29.970 | 1999-08-17 |
| 4216 | True | 1960 | 2 | 29 | 33.0 | 5.9 | MOROCCO | NaN | MOROCCO: AGADIR | 13100.0 | Very Many (~1001 or more deaths) | 25000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 30.450 | -9.620 | 1960-02-29 |
| 5076 | True | 1985 | 9 | 19 | 28.0 | 8.1 | MEXICO | NaN | MEXICO: MICHOACAN: MEXICO CITY | 9500.0 | Very Many (~1001 or more deaths) | 30000.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 18.190 | -102.533 | 1985-09-19 |
| 5399 | True | 1995 | 1 | 16 | 22.0 | 6.9 | JAPAN | NaN | JAPAN: SW HONSHU: KOBE, AWAJI-SHIMA, NISHINO... | 5502.0 | Very Many (~1001 or more deaths) | 36896.0 | Very Many (~1001 or more deaths) | EXTREME (~$25 million or more) | 34.583 | 135.018 | 1995-01-16 |
| 4671 | True | 1974 | 12 | 28 | 22.0 | 6.2 | PAKISTAN | NaN | PAKISTAN: BALAKOT, PATAN | 5300.0 | Very Many (~1001 or more deaths) | 17000.0 | Very Many (~1001 or more deaths) | MODERATE (~$1 to $5 million) | 35.100 | 72.900 | 1974-12-28 |
Now we can see that Haiti earthquake was indeed a tsunami, so that contributed to these values.
Tsunamies with extreme damage: 63 Normal earthquakes with extreme damage: 200 damage percentage: 31.5
So it is a little bit more likely that tsunami will cause extreme damage than earthqauke, comparing this values with frequency of tsunamies.